Caching and Locality

Tom Davis
Silicon Graphics

November, 1994

Many computer science textbooks on algorithms assume that your computer's memory is all the same in the sense that it is no more expensive to access any word than it is to access any other. This makes the analysis easy, but on modern computers, it is far from true. Some types of memory can be 1000 times slower to access than others, and particularly bad programming can slow down a program's running time by factors of tens or hundreds. This tutorial attempts to explain why it happens, how to avoid it, and things to watch out for in your code.
The explanations below are necessarily simplified, but it's likely that your local processor designer would be glad to tell you the gory details if you want. Fortunately, most people are able to vastly improve their code without knowing the exact details. Of course, the more you know, the better a programmer you will be.
My terminology below is a bit non-standard. I've basically referred to all levels of the memory hierarchy as caches, since in some sense, they all behave the same way, and the same principles apply when thinking about each level. Usually, only the primary and secondary caches are referred to by the name "cache".
Part 1: Hardware Reality

What is a cache?

Computers have different kinds of memory, and generally, the faster the memory is, the more expensive it is. Caching is an automatic mechanism supported by the hardware that attempts to keep data that's heavily used in memory that is very fast, and little-used data in slow memory. The general idea is that when you access a word of memory, it is copied into the cache so that if you need to access it again within a short time, it will be in the highest speed memory. Of course, there's only a small amount of the highest speed memory, so if you don't access the word for a while, it will be replaced in the cache by some other word. In addition, since memory tends to be accessed sequentially (especially machine instructions), when the computer needs to read a new word into the cache, it will often read the next few words into the cache at the same time in the hope that you will access them on successive instructions and they will already be available in the high speed memory.
The caches are shared by all the processes running on a machine, so just because you "know" that your program will always be running on a machine with, say, a 2 megabytes of secondary cache, don't write code that will behave badly with less than that. You're sharing with other processes, and the more you share with, the less effective cache space your process will have.

How does a cache work?

The primary and secondary caches are accessed with something called content-addressable memory. Since the cache may contain memory with somewhat random addresses, both the address and the data must be saved. When the processor tries to access text or data using its address, all the addresses currently in the cache are simultaneously tested against the address, and if there is a match, the corresponding data or computer instruction is returned. Notice that this is different from regular memory -- if the processor wants the data at address 10, it simply has to look in the 10th slot; in the cache, any of the cache slots' address portion may contain the address field, and hence all must be searched simultaneously. The sought-after address is basically broadcast to all the cache's address slots, and if there's a match in one of them, the corresponding data is returned.

What is the memory hierarchy?

Modern computers have a whole sequence of memories, each operating at different speeds. The sequence is called a memory hierarchy, essentially beginning with the registers, and successive "caches" are larger and slower. As your program makes access to larger and slower caches, the amount of memory brought in each time increases. Here's a quick description of the different sorts of caches:

Registers
The fastest memory. Usually, there are on the order of tens of these -- 16 or 32 of them. You can still specify "register variables" for your C compiler, but nowadays, the compiler is likely to do a better job than you at guessing which variables belong in registers and which do not.

Primary cache
The primary cache is usually on the same chip with the CPU, and can be accessed extremely quickly since the memory requests do not have to go across chip boundaries. When you fetch a word to the primary cache, a "cache line" typically comes with it which includes the nearest 4 or 8 words. A typical size for the primary cache is 8K or 16K (for both instructions and data) on our smaller machines.

Secondary cache
Secondary cache is usually on a chip placed right next to the CPU chip, and its memory is quite fast. A typical size for the secondary cache is perhaps a megabyte on the smaller machines.

Main memory
The main memory is slower than the secondary cache, but is typically quite large. If you have a 32 megabyte machine, that means you have 32 megabytes of main memory. It's quite a bit slower than the secondary cache, and is usually accessed using the main system bus, which may be shared with I/O devices, DMA devices, and so on.

Virtual memory
It's actually possible to use more memory than your main memory contains by making use of virtual memory. It works roughly like this: suppose you have a machine with 32 megabytes available, but you want to run a process that requires 50 megabytes of storage. 50 megabytes is set aside on the disk in chunks called "pages" that contain the program and data, only some of which are copied into the computer's main memory. When you access a page that's not in main memory, the processor faults, and the operating system looks to make sure the address you've touched is a valid page. If so, it copies that page from the disk into main memory, and sets up some magic hardware stuff so that when you touch other addresses on that page, you will access the main memory that it has just "paged in".

Of course, pretty soon all your main memory will be in use, and to page in another page, the operating system will have to get rid of some other page. The one it chooses to get rid of depends on its paging algorithm, and there are heated debates on the best way to write such an algorithm, but when it does choose one, it first has to write out any changes you may have made to the page to the disk. Page sizes are different on different machines, but on ours, it is currently 4096 bytes. A typical algorithm for deciding which page to replace that's not too bad is called the RLU or "least recently used" algorithm, which pages out the memory that hasn't been accessed in the longest time.

What is a cache miss? What's the penalty for missing?

A cache miss occurs when your program can't find the word it needs in a cache (at any level) The hardware is actually a bit smarter than this, but you can think of what happens as follows: A register access is always there, so these never cause a problem (on our machines, at least). The interesting stuff occurs with main memory accesses. First, the primary cache is searched for the word. If it's there, you're done. The program reads the word, and continues on its merry way. If it's not there, search the secondary cache. If found, copy it (and a few of its neighbors) into the primary cache and to the processor and continue. If it's not in the secondary cache, try to fetch it from main memory, and while you're at it, put copies in the primary and secondary caches. Et cetera.

Mashey's 10:1 rule of thumb

John Mashey has a rough rule of thumb that seems crazy, but turns out not to be too bad. He says that each time the memory has to be accessed from a slower cache, the final speed is about ten times slower than it would have been had the data been in the next faster cache.

What happens when the cache is full?

If any cache is full, the same thing happens as when the process page faults. New data is copied into the cache, clobbering some data that's already there. If the data that's there has been modified, before it is clobbered, it will first have to be written to a slower cache. It's possible to design clever hardware that automatically tries to write modified cache entries back to slower caches if the time and bus cycles are available, but you can't be guaranteed that this will happen, and every cache miss may force the processor to write the changed data back to the slower caches.

Instruction caches and data caches

On many machines (ours included), caches for computer instructions and data are treated differently. That's because the contents of the instructions are never changed in normal operation. Obviously, the code that loads the instructions in the first place violates this idea, and so would self-modifying code, but in both cases the programmer has to do something special to modify the code's instructions (often called "text", as opposed to "data"). Thus misses to the instruction cache never require the processor to write the modified data back to the slower caches. Data caches always have to be on the lookout for this, and are consequently more expensive to implement.
Part 2: Software Considerations

Data:

Keep heavily accessed data together
Keep heavily modified data together
Many times, even though you declare "variables", the contents are not really variable, or at least, they're not too variable. For example, you may declare space for a table that's filled once at program initialization, and then never touched again. It's often a good idea to keep the "variable variables" together and the "fixed variables" together in memory. To see why, let's look at the worst case. Suppose you have a page (or cache line -- the size doesn't matter except that different caches will be affected) with only a single variable whose value can change and all the rest is basically constant. Now if your cache is full, and this is the page (or cache line) that gets replaced, the whole contents will have to be written back to the slower caches if that single value has been changed. If no variables change, the data behaves just like text -- no changes have occurred, so the data in the cache can just be clobbered by the incoming data.

In data structures
This is true in data structures (put the parts that vary together at the beginning (or end) of the data structure) Today's compilers just load the data in memory in the order in which you declare it.

In declarations
Your global variables work the same way. The compilers just tend to pack them into memory in the order you declare them. The two examples below show the right way and the wrong way to declare variables. In this example, v1, v2, ... are the "variable variables", and f1, f2, ..., are the "fixed variables":

Bad:

long v1[300]; long f1[300]; long v2[300]; long f2[300]; ...
Good: long v1[300]; long v2[300]; ... long f1[300]; long f2[300]; ...
At a finer-grained level, declare variables you use frequently together in data structures so that they'll be in the same cache line. For example, suppose you have a linked data structure with a key and some data, and your program will typically search through the linked list looking for the key, and when it finally finds it, it will use the data. In this example, the link and the keys will be accessed constantly, but the data will rarely be accessed. Consider the following code fragment:

struct foo { struct foo *next; long key; long data[20]; }; struct foo *ptr; ... ptr = head; while (ptr) { if (ptr->key == searchkey) { DoStuffWithData(&ptr->data); return; } ptr = ptr->next; }

This isn't bad, but what if the structure had been declared as follows:

struct foo { struct foo *next; long data[20]; long key; };

In this case, since each time though the loop, both next and key are accessed, and data is so large, you're certain to use two cache lines to get both key and next. It will exhaust the cache twice as fast as the previous example, and if the list is long enough to exhaust the primary cache, it will run (if Mashey is right) up to 10 times slower than the first form.
In general, try to keep the data in your program that's accessed most frequently close together, whether it be in data structures, in declared variables, or on the stack.

Make sequential accesses
Similarly, when you write your code, access the data in your data structures sequentially. That way you'll tend to use all the data that's hauled in for a single cache line, rather than having to bring in multiple cache lines for the same amount of data.
For multi-dimensional arrays this can make a difference too. The arrays are laid out sequentially in memory, and you need to march through all the entires, if you vary the first parameter fastest, you'll take large jumps through memory, and if you vary the last parameter fastest, you'll go through memory sequentially.

Memory fragmentation dangers
If your program does a lot of allocation and freeing of memory, depending on your (and malloc's) memory allocation strategy, the free memory in your address space may get fragmented, and your program is only actually using little pieces here and there. This may cause major paging problems.

Traversing long lists danger of LRU algorithm
If your program repeatedly traverses a long linked list, and the operating system uses the LRU (least recently used) algorithm to decide what to toss out when there's a cache miss, if the list is longer than the cache, each time through the list, you'll have to completely reload the cache. Obviously, if there were some way to know that this is what you're up to, the OS would simply keep clobbering the same cache entry, and leave the rest of the cache intact in hopes of getting at least some hits after the list traversal is complete.

Danger of deep recursion
If your program uses tail recursion instead of iteration (which is probably a bad idea algorithmically anyway), a deep recursion will use lots of stack space to store local variables, return addresses, and so on, and this may wipe out the contents of your data cache in a hurry.

Code:

Mixing error code with regular code
Mixing "unlikely" code with mainline code
Good code often does a lot of error checking. In addition, many times your code has to be prepared to handle unusual conditions. Consider the following code:

if (Error_Condition_1) { <handle error> } <do next few steps> if (Unusual_Condition_1) { <handle unusual condition> } <do next few steps> if (Error_Condition_2) { <handle error> } <do next few steps> if (Unusual_Condition_2) { <handle unusual condition> } <do next few steps>

and so on. Assuming the code to handle the errors and unusual conditions is in-line, this code may cause very bad cache behavior, since the usual execution will not go into any of the error or unusual condition blocks. But these blocks will spread out the most-used code, and will cause much more paging than is necessary. An easy way to improve the situation is to make the blocks into subroutine calls so only a minimal amount of unexecuted code appears in the most used paths.
Of course, the worst thing you can do is to leave in dead code -- it's never executed.

Advantages and disadvantages of unrolling loops
Unrolling loops eliminates loop overhead, but can cause cache misses. This is a very hard problem to get right. It depends critically on cache sizes.

Using cord. Emulating cord by hand
Cord tries to put basic blocks that are heavily used together so that they're more likely to be in memory when needed. My experience with the cord program is that it is broken a lot, and that there are a few ways you can emulate cord by hand. If a routine, for example, always calls another, why not put the routines together in the file. Similarly, put all the rarely-called error code or exceptional condition code together.

Algorithms:

Hashing: comparison of hash collision resolution algorithms
Hash collision resolution by chaining may wander through a lot more pages than if the resolution is done by filling in adjacent slots.

Sorting: possible advantages of merge sort over other techniques

What's wrong with bubble sort/insertion sort.
For large lists, repeated sweeps through the memory are sure to cause cache catastrophes.

Why bubble sort may beat shell sort.
At least bubble sort does stuff on one page for a while. Shell sort will cause accesses to jump wildly around the pages.

Keeping keys and data separate.
If you've got small keys and large data, sort the keys, and keep pointers to the data. Keep the keys together.

Linked list traversal
This can run through a lot of memory, especially after a lot of allocation/free cycles. It's not easy to get around this without some method of double indirection to access the lists, and doing some of malloc's work yourself, but if it's important enough, it might be a reasonable thing to do. Similarly, there are cases where you might just use arrays instead of lists, particularly if you don't have to shuffle the data around a lot.

Arrays versus lists
Arrays are at least sequential. Lists can do anything. See above.

Display lists (depth first versus breadth first traversal)
For typical display lists with repeated instances, breadth-first traversals may be better for caching -- in spite of the fact that the matrix operations may be a bit clumsier to handle.
As an example, consider a very simple example, where you're drawing an automobile, and each auto has 4 wheels, and each wheel has 5 bolts. Obviously, with such a tiny model, no matter what you do, you'll have a hard time having cache troubles, but imagine that the object's structure is nested much more deeply, and that the items to be drawn are complicated.
If you draw all the bodies first, then all the wheels, then the bolts, you can be sure you'll avoid caching troubles, although the "natural" ordering would be more like this:

body 1 left front wheel of body 1 bolt 1 on left front wheel of body 1 bolt 2 on left front wheel of body 1 bolt 3 on left front wheel of body 1 bolt 4 on left front wheel of body 1 bolt 5 on left front wheel of body 1 right front wheel of body 1 bolt 1 on right front wheel of body 1 bolt 2 on right front wheel of body 1 bolt 3 on right front wheel of body 1 bolt 4 on right front wheel of body 1 bolt 5 on right front wheel of body 1 ... body 2 left front wheel of body 2 ...
(and so on)

Caching and Locality

Tom Davis Silicon Graphics

November, 1994

Part 1: Hardware Reality

What is a cache?

How does a cache work?

What is the memory hierarchy?

What is a cache miss? What's the penalty for missing?

Mashey's 10:1 rule of thumb

What happens when the cache is full?

Instruction caches and data caches

Part 2: Software Considerations

Data:

Code:

Algorithms:

Tom Davis
Silicon Graphics